21 research outputs found

    A model of Poissonian interactions and detection of dependence

    Get PDF
    This paper proposes a model of interactions between two point processes, ruled by a reproduction function h, which is considered as the intensity of a Poisson process. In particular, we focus on the context of neurosciences to detect possible interactions in the cerebral activity associated with two neurons. To provide a mathematical answer to this specific problem of neurobiologists, we address so the question of testing the nullity of the intensity h. We construct a multiple testing procedure obtained by the aggregation of single tests based on a wavelet thresholding method. This test has good theoretical properties: it is possible to guarantee the level but also the power under some assumptions and its uniform separation rate over weak Besov bodies is adaptive minimax. Then, some simulations are provided, showing the good practical behavior of our testing procedure.Comment: 27 page

    Variable selection using Random Forests

    Get PDF
    International audienceThis paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    VSURF : un package R pour la sélection de variables à l'aide de forêts aléatoires

    Get PDF
    National audienceThis paper describes the R package VSURF. Based on random forests, it delivers two subsets of variables according to two types of variable selection for clas-sification or regression problems. The first is a subset of important variables which are relevant for interpretation, while the second one is a subset corresponding to a parsimo-nious prediction model. The strategy is based on a preliminary ranking of the explanatory variables using the random forests permutation-based score of importance and proceeds using a stepwise ascending variable introduction strategy. The two proposals can be ob-tained automatically using data-driven default values, good enough to provide interesting results, but can also be fine-tuned by the user. The algorithm is illustrated on a simulated example and its applications to real datasets are presented.Dans cette présentation, nous décrivons VSURF, un package R. Basé sur les forêts aléatoires, il fournit deux sous-ensembles de variables associé a deux objectifs de sélection de variables pour des problèmes de régression et de classification. Le premier est un sous-ensemble de variables importantes pour l'interprétation. Le second est un sous-ensemble parcimonieux a l'aide duquel on peut faire de bonnes prédictions. La stratégie générale est basée sur un classement préliminaire des variables donné par l'indice d'importance des forêts aléatoires, puis utilise un algorithme d'introductions ascendantes de variables pas a pas. Les deux sous-ensembles peuvent être obtenus automatiquement en gardant le comportement par défaut du package, mais peuvent également être réglés en jouant sur plusieurs paramètres. Nous illustrons la méthode sur plusieurs jeux de données réelles

    Variable selection using Random Forests

    Get PDF
    International audienceThis paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy

    Variable selection through CART

    No full text
    International audienceThis paper deals with variable selection in the regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow to get oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure can not be executed when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results

    Inference of functional connectivity in Neurosciences via Hawkes processes

    No full text
    1st IEEE Global Conference on Signal and Information Processing 3-5 Dec. 2013 , Austin (USA)We use Hawkes processes as models for spike trains analysis. A new Lasso method designed for general multivariate counting processes enables us to estimate the functional connectivity graph between the different recorded neurons.nonnonrechercheInternationa

    VSURF: An R Package for Variable Selection Using Random Forests

    Get PDF
    International audienceThis paper describes the R package VSURF. Based on random forests, and for both regression and classification problems, it returns two subsets of variables. The first is a subset of important variables including some redundancy which can be relevant for interpretation, and the second one is a smaller subset corresponding to a model trying to avoid redundancy focusing more closely on prediction objective. The two-stage strategy is based on a preliminary ranking of the explanatory variables using the random forests permutation-based score of importance and proceeds using a stepwise forward strategy for variable introduction. The two proposals can be obtained automatically using data-driven default values, good enough to provide interesting results, but can also be tuned by the user. The algorithm is illustrated on a simulated example and its applications to real datasets are presented
    corecore